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Abstract 

Background: A significant interest in spatial epidemiology lies in identifying associated risk factors which enhances 
the risk of infection. Most studies, however, make no, or limited use of the spatial structure of the data, as well as 
possible nonlinear effects of the risk factors. 

Methods: We develop a Bayesian Structured Additive Regression model for cholera epidemic data. Model 
estimation and inference is based on fully Bayesian approach via Markov Chain Monte Carlo (MCMC) simulations. 
The model is applied to cholera epidemic data in the Kumasi Metropolis, Ghana. Proximity to refuse dumps, density 
of refuse dumps, and proximity to potential cholera reservoirs were modeled as continuous functions; presence of 
slum settlers and population density were modeled as fixed effects, whereas spatial references to the communities 
were modeled as structured and unstructured spatial effects. 

Results: We observe that the risk of cholera is associated with slum settlements and high population density. The 
risk of cholera is equal and lower for communities with fewer refuse dumps, but variable and higher for 
communities with more refuse dumps. The risk is also lower for communities distant from refuse dumps and 
potential cholera reservoirs. The results also indicate distinct spatial variation in the risk of cholera infection. 

Conclusion: The study highlights the usefulness of Bayesian semi-parametric regression model analyzing public 
health data. These findings could serve as novel information to help health planners and policy makers in making 
effective decisions to control or prevent cholera epidemics. 

Keywords: Bayesian, Cholera, Cholera reservoir, Refuse dumps, Slums 



Background 

A significant interest in understanding the epidemiology 
of diseases lies in identifying associated risk factors 
which enhance the risk of infection, the so called eco- 
logical studies [1,2]. Most of these ecological studies, 
however, make no, or limited use of the spatial structure 
of the data, neither do they consider possible nonlinear 
effects of the risk factors. Thus, most studies use stand- 
ard statistical methods such as the classical and general- 
ized linear models that ignore methodological difficulties 
that arise from the nature of the data. Ali et al [3,4] 
have used logistic, simple and multiple linear regression 
models to study the spatial epidemiology of cholera in 
an endemic area of Bangladesh. Other ecological studies 
of cholera that have utilized standard statistical methods 
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include Ackers et al [5], Mugoya et al [6] and Sasaki 
et al [7]. These methods when applied to spatially dis- 
tributed data present severe problems with estimating 
small area spatial effects, and simultaneously adjusting 
for other risk factors, in particular if such effects are 
nonlinear. If standard statistical methods are used to 
analyze spatially correlated data, the standard error of 
the covariate parameters is underestimated and thus the 
statistical significance is overestimated [8]. 

Generalized additive models (GAM) provide a power- 
ful class of models for modeling nonlinear effects of 
continuous covariates in regression models with non- 
Gaussian responses. Structured Additive Regression 
(STAR) models are extensions of GAM models that 
allow one to incorporate small area spatial effects, non- 
linear effects of risk factors, and the usual linear or fixed 
effects in a joint model [9]. This study applies a STAR 
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modeling approach to develop a multivariate explanatory 
model for cholera. 

Cholera outbreak is enhanced by several environmen- 
tal and/or socioeconomic risk factors once introduced in 
a population. Ali et al [3,4] identified proximity to sur- 
face water, high population density, and low educational 
status as the important risk factors of cholera in an en- 
demic area of Bangladesh. Borroto and Martinez-Piedra 
[10] identified poverty, low urbanization, and proximity 
to coastal areas as the important geographic risk factors 
of cholera in Mexico. Sanitation is an important envir- 
onmental risk factor that predisposes inhabitants to 
cholera infection. Previous ecological studies have used 
spatial regression models to explore the dependency of 
cholera on some local measures of sanitation [11,12]. No 
attempt, however, has been made to combine all the 
identified measures of sanitation, including spatial 
effects, into a single multivariate model to examine their 
joint effects on cholera. In this study, we exploit the 
joint effects of three main spatial measures of sanitation 
identified from previous studies [11,12]. These are dens- 
ity of refuse dumps, proximity to refuse dumps and 
proximity to potential cholera reservoirs. Other risk fac- 
tors used in this study include livelihood at slummy and 
squatter environments [13], and population density 
[3,4,14,15]. Livelihood at slummy and squatter environ- 
ments increase the risk of cholera infection, whereas 
high population density stresses existing sanitation sys- 
tems, thus putting people at increased risk of cholera. 

This study incorporates the effects of nonlinear risk 
factors and the usual fixed effects of some risk factors, 
while accounting for both structured and non structured 



spatial effects. A STAR model of this type has been 
termed geoadditive model [16,17]. The increasing avail- 
ability of disease and environmental data necessitate the 
development of such models to obtain valid and realistic 
statistical inferences that adequately describe the vari- 
ation of the disease. Proximity to dumps, density of 
dumps, and proximity to potential cholera reservoirs are 
modeled as smooth continuous functions, whereas pres- 
ence of slum settlers and population density are modeled 
as fixed effects, and spatial references to the communi- 
ties are modeled as structured and unstructured spatial 
effects. We use a fully Bayesian estimation based on 
Markov Chain Monte Carlo (MCMC) simulations using 
simple Gibbs sampling updates. Making inferences based 
on a fully Bayesian approach is preferred because the 
functionals of the posterior can be computed without 
relying on large Gaussian justifications, thereby quantify- 
ing the uncertainty in the parameters [18]. 

Methods 

Study area and cholera data 

This study is based on the 2005 cholera outbreak in 
Kumasi Metropolis, Ghana. Kumasi Metropolis is com- 
pletely urban and the most populous city in Ashanti 
Region. It is located at the intersection of latitude 6.04°N 
and longitude 1.28°W, covering an area of approximately 
220 km 2 (See Figure 1). Kumasi has a population of ap- 
proximately 1.2 million. Surveillance and reporting of 
the disease before 2005 has been ineffective, and hence 
the existing data before 2005 have little or no spatial in- 
formation. However, with intensified surveillance and 
reporting systems during an outbreak in 2005, disease 
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Figure 1 Map of Ghana and neighboring countries (left), and Kumasi (right). Dots indicate the centroids of communities. 
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cases in Kumasi are available at community level spatial 
units. This makes the Kumasi area suitable for such a 
study. During the outbreak in 2005, cholera incidence 
rates ranged from 0.47 to 31.92 per 10,000 people 
(mean = 10.21, standard deviation = 6.84). 

The topographic map of the metropolis and the n = 68 
communities where cholera records are available was 
digitized. Cholera data for each community was 
extracted from disease records of the Kumasi Metropol- 
itan Disease Control Unit (DCU). We accessed such 
data based on special permissions given by the Kumasi 
DCU. The centroids of the communities were used as 
the spatial references of cholera cases since residential 
addresses were not recorded during the outbreak. The 
denominator (population data) for computing community- 
specific cholera rates was obtained from the 2000 Popu- 
lation and Housing Census of Ghana [19]. 

Model specification 

For each community z, i = 1, . . . ,N of population P b the 
observed number of cholera cases Chol( 0 )i is assumed to 
be a realization of random variable that follows indepen- 
dent Poisson distribution with intensity Chol^y • Chol( R y, 
thus: C/zo/( 0 )/|C/zo/(7 ? ) i ~Poisson(C/zo/(£) i • C/zo/^), where 
Ch°l(E)i is the expected number of cholera cases and 
Chol^iis the relative risk of cholera infection. A com- 
mon practice is to estimate Chol^i as Chol^ - Pi, where 
Chol^ is the overall risk of cholera infection within the 
study population obtained as a weighted average of the 
community-specific rates, each weighted by their share 
in the overall population; thus: 

i=l 1 1 l^i=l l i 

For ease of interpretation, we use the relative risk (also 
called excess risk) as the reference benchmark to esti- 
mate the risk of cholera infection. We consider the triple 
(Chol(ty,Xi, Wi),i = 1, . . . ,N where Chol^ is the rela- 
tive risk of cholera infection in community L The vector 
#i = • • • ,%ip) contains the p continuous covariates 
and Wi = (wa, . . . , w ir ) is a vector of r categorical cov- 
ariates. In our study, p = 3 and r = 2. The study assumes 
that the response variable Chol^ R ) is Gaussian distribu- 
ted, Le.Chol^i\n^ a 2 ^N(r/^ a 2 ) , with an unknown mean 
rji which can be expressed in the form: 

r/ t = x + w 'ij. (1) 

Here, ft is a ^-dimensional vector of unknown regres- 
sion coefficients for the continuous covariates x b and y 
is a r-dimensional vector of unknown regression coeffi- 
cients for the categorical covariates w b 



In order to account for both the nonlinear effects of 
the continuous covariates and the spatial dependence 
of the data, a geoadditive modeling approach is re- 
quired [16]. The geoadditive model replaces the strictly 
linear predictor by a more flexible semi-parametric 
predictor as: 

rji =fl + • • • +fp(*i,p) +fspat{Si) + w t y. (2) 

Here, fi(x), . . . ,f p (x) are nonlinear smooth functions 
of the continuous covariates x^i, . . . ,x ijP and f sp at(si) is a 
function that accounts for spatial effects at each commu- 
nity Si e {1, . . . ,5}. Spatial effect is usually a surrogate 
of unobserved influential factors, some of which may 
have a strong spatial structure and others may be 
present only locally (unstructured). To distinguishing be- 
tween the two kinds of influential factorsy^(s) is split 
up into spatially correlated (smooth) part f s tr(s) and 
spatially uncorrected (unsmooth) part f um tr(s) , i.e. 

fspat( s ) = fstr( s ) +funstr(s)- 

The final geoadditive model is then expressed as: 

Vi =/l(*i,l) + • • • +fp(xi,p) +fstr( S i) +funstr(Si) 

+ w'iy. 

(3) 

This model contains p + 2 functions and r fixed para- 
meters to be estimated. 

Prior distributions for covariates 

A fully Bayesian approach for modeling and inferences 
requires prior assumptions for the unknown functions 
fj(x)Junstr(s)ifstr{s) and the fixed effect regression par- 
ameter y. For y we assume an independent diffuse prior 
p(y) oc const due to the absence of any prior knowledge. 
A possible alternative choice is a weak informative 
multivariate Gaussian distribution. 

For the continuous functions fj(x)J = 1, ...,p , we 
choose the Bayesian P(enalized)-splines [20,21]. This ap- 
proach assumes that an unknown smooth function fj of 
a covariate Xj can be approximated by a polynomial 
spline of degree / defined on a set of equally spaced 
knots xf» = Cy |0 < C/,i < - < C ;> -i < C,> = ^within 
the domain of x ; -. Such a spline can be written in terms 
of a linear combination of d = s + / basis functions 
Bffj, i.e. 

d 

fi( x i) = ^2^m'B m (xj). (4) 

m=l 

The B-splines form a local basis since the functions 
B m are only positive within an area spanned by / + 2 
knots. This property is essential for the construction of 
the smoothness penalty for P-splines. The estimation of 
fj (Xj) is thus reduced to the estimation of the vector of 
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unknown regression coefficients f ; = . . . , £j m ) 
from the data. An essential factor in the estimation pro- 
cedure is the choice of the number of knots. We chose a 
moderately large number of equally spaced knots (20), 
as suggested by Eilers and Marx [20] to ensure enough 
flexibility to capture the variability of the data. In the 
Bayesian approach, penalized splines are introduced by 
replacing the difference penalties with their stochastic 
analogues, i.e., first or second order random walk priors 
for the regression coefficients. A first order random walk 
prior for equidistant knots is given by: 



£/,ra — £j,m—l ^j,m: — 2, . . . , d 1 



(5) 



and a second order random walk for equidistant knots 
by: 



£j,m — ^j,m—l f;',w-2 u j,mi Wl — 3, . . . , t 



(6) 



where u^ m A/^0, r ; 2 ^ are Gaussian errors. Diffuse priors 

£ji oc const, or fy^ andc; ; 2 °c const, are chosen as initial 
values, respectively. The joint distribution of the regres- 
sion parameters \ m for a first order random walk is 
defined as: 

f;',w|f;',w-l ^^/,w-l5 r ; 2 ) 5 (7) 

and a second order random walk is defined as: 



(8) 



The first order random walk induces a constant trend 
for the conditional expectation of m given f ; w _iand a 
second order random walk results in linear trend de- 
pending on the two previous values £ ; ; w _i and £ ; ; w _2 • 

The joint distribution of the regression parameters = 

. . . , f ; w ) is computed as a product of the condi- 
tional densities defined by the random walk priors. The 
general form of the prior for £ y is a multivariate Gaussian 
distribution with density: 



oc exp 



(9) 



where the precision matrix Kj acts as a penalty matrix 
that shrinks parameters towards zero, or penalizes too 
abrupt jumps between neighboring parameters. Since the 
penalty matrix Kj is rank deficient, Le.kj = rank (A/) < 

dim(c; ; ) = dj, it follows that the prior for <f ; - rj is partially 

improper with Gaussian prior f ; |r ; 2 oc A/^0; tJKj~^, where 
Kj~ is a generalized inverse of Kj. The tradeoff between 
flexibility and smoothness is controlled by the variance 



parameter r?. A large variance corresponds with a rough 
estimated function, and vice versa. 

Spatial components 

We use the nearest neighbor Gaussian Markov random 
field model which is common in spatial statistics to ex- 
press prior knowledge of the structured spatial effects. 
Suppose s G {1, ... ,5} represent the locations of con- 
nected communities, then the locally dependent prior 
probability spatial structure can be specified as: 



fstr(s) fstr{s),S ^ S, 



'str 



s'eds 



(10) 



where N s is the number of adjacent spatial units and 
s G ds denotes that spatial unit s' is a neighbor of spatial 
unit s. Thus, the conditional mean of f str (s) is an 
unweighted average of the function evaluations of neigh- 
boring spatial units. Since only the centroids of commu- 
nities (point data) are available, we assume the effect of 
spatial interaction is dependent on distance between the 
centroids of pair of communities. To ensure equal num- 
ber of neighbors for each community we chose a neigh- 
borhood structure based on the /<th nearest neighbor 
method (where k is the number of neighbors). This ap- 
proach results in an asymmetric neighborhood matrix; 
therefore, false symmetry was imposed to ensure a sym- 
metrical neighborhood structure. Like the continuous 
functions fp the tradeoff between flexibility and smooth- 
ness is controlled by the variance parameter r^ r . 

For the unstructured spatial effects, we assume that 
the parameters f unstr (s) are Ltd. Gaussian: 



funstr (s) 



(ii) 



Hyperpriors for the variance or smoothness para- 
meters tJJ= 1, • • • ,p,str, unstr, are considered as un- 
known. Therefore, highly dispersed, but proper, inverse 
Gamma distributions p{jf) ~ IG{a^ bj) with known 

hyper-parameters ay and bj are assigned in the second 
stage of the hierarchy. The corresponding probability 
density function is expressed as: 



p r ; 



oc 



exp 



(12) 



In this study, we use the standard option hyper- 
parameters proposed by Farhmeir et al [18]: IG 
(a = b = 0.001). 
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n p r 

P (6\Ch0l) OC l[L(ch0l m , n) X ft \p{S,tf) P tf)] X p(fstr\TlMfunstr\T 2 unstr ) X \{p(y)p{^), (13) 

i=l /=! 7=1 



Bayesian inference 

Bayesian inference stems from the posterior distribution, 
that is, the conditional distribution of the model para- 
meters given the observed chX2i p{0\Chol^) , where 6 
denotes the vector of all model parameters, Chol^ the 
data vector, p (.) represents the probability density func- 
tion. In this study, we use a fully Bayesian inference 
based on analysis of posterior distribution of the model 
parameters by drawing random samples via MCMC 
simulation techniques. The probability density function 
of the posterior distribution is expressed as: 
where L (.) is the likelihood function. The full condi- 
tional for the variance components rjj = 1, . . . ,p, str, 
unstr, and o 2 are inverse Gamma distributions. The full 
conditional for the fixed parameters y, the unknown par- 
ameter vector £ l5 . . . ,£ p , as well as f str (s), f umtr (s) are 
multivariate Gaussian. Gibbs sampler was employed for 
MCMC simulations, drawing successively from the full 
conditionals for the variance components and the un- 
known parameters. Cholesky decompositions for band 
matrices were used to efficiently draw random samples 
from the full conditional [22,23]. 

Model implementation 

The continuous covariates used in this study are proxim- 
ity to refuse dumps d dumps , density of refuse dumps Pdump> 
and proximity to potential cholera reservoirs d reser . These 
variables are extracted on per community basis via a 
Geographic Information System (GIS). Details of the 
approaches for the calculation of these variables can be 
found in Osei and Duker [11] and Osei et al [12]. The 
spatial locations of the communities are used to model 
the spatial effects. In the Kumasi area no administrative 
boundaries are present separating the communities. For 
ease of visualization and interpretation, the centroids of 
the communities are converted to Thiessen polygons 
whose boundaries define the area that is closest to each 
centroid relative to all other centroids. 

In addition, two binary categorical covariates are used; 
presence of slum settlers in a community q s i um and popu- 
lation density p pop . For communities within which slum 
settlers dwell, q mw =1, otherwise c 5 / Mm =0. Since the 
boundaries of the various communities do not exist the 
population density could not be quantified as continuous 
variable. Therefore, we categorized the population dens- 
ity as moderately populated p pop = 0 and densely popu- 
lated p = 1. We analyze the following set of models. 



Model 1 : r\ i 

= P dumpPl + d dump fa + d reset fa + P p0 pYl 
+ c slumY2 

Model 2 : rj i 

= fl(Pdump) +h(ddump) + f?>(d reser ) 
+ P popYl + C slumY2 

Model 3 : rj i 

= fl(pdump) +h( d dump) +&{dreser) +fstr( S ) 
+funstr(s) +p pop Y X +C 

slum Y2 

Model 1 is a strictly linear regression that assumes a 
linear effect of the categorical and continuous covariates. 
Model 2 is an additive model which assumes nonlinear 
functions for the continuous covariates and linear effects 
of the categorical covariates. Model 3 is a geoadditive 
model, which is an extension of Model 2 that incorpo- 
rates both structured and unstructured spatial effects. 

The models were implemented in the public domain 
software BayesX ver 2.0 [24,25]. We used a total number 
of 40,000 MCMC iterations and 10,000 number of burn 
in samples. Since, in general, these random numbers are 
correlated, only every 20 th sampled parameter of the 
Markov chain were stored. This yielded 2,000 samples 
for parameter estimation. Convergence checks of the 
MCMC algorithms were based on autocorrelations and 
the sampling paths. 

We compared the strictly linear models with the addi- 
tive models and the geoadditive models using the Devi- 
ance Information Criterion (DIC) values [26]. DIC is a 
Bayesian tool for model checking and comparison, 
where the model with the smallest DIC is preferred. The 
DIC is given by DIC = D+p D , where D is the posterior 
mean of the deviance, which is a measure of goodness of 
fit, and p D is the effective number of parameters, which 
is a measure of model complexity and penalizes over- 
fitting. 

Results 

Model selection 

Model assessment and selection was based on the com- 
puted values for the goodness of fit (see Table 1). Models 
with a smaller DIC value are preferred. Again, models 
with differences in DIC of less than 3 cannot be 
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Table 1 Comparison of model fit using Deviance 
Information Criterion (D/C) 



Model Fit 


Model 1 


Model 2 


Model 3 


D 


37.40 


32.35 


10.64 


pD 


5.85 


8.95 


9.43 


DIC 


43.25 


41.30 


20.07 


\lDIC 


23.18 


21.23 


Reference 



^Difference of Model 3 against Models 1&2. 



distinguished, while those between 3 and 7 can be weak- 
ly differentiated [27]. Comparing goodness of fit of 
models, Model 3 is the preferred model Although the 
extension of the basic model (Model 1) to an additive 
model (Model 2) is an improvement; this improvement 
is indistinguishable {DIC = 43.25 in Model 1 versus DIC = 
41.30 in Model 2, IC = 1.95). The extension of Model 2 
to include structured and unstructured spatial effects in 
Model3 significantly improved the model (DIC = 20.07 in 
Model 3 versus DIC =4130 in Model 2, IC = 21.23). 
Therefore, subsequent analysis and discussions are based 
on the results of Model 3. 

Fixed and nonlinear effects of covariates 

The purpose of Model 1 has been to investigate the ap- 
propriateness of including nonlinear effects in disease 
modeling. In Model 1, the continuous covariates pdump 
and d reser are observed to have no significant effect on 
Chol( R ) which would have led to an erroneous rejection 
of the significance of their effect (Table 2). In Model 3, 
the effects of the categorical covariates are assumed 
fixed are estimated jointly with the continuous and 
spatial covariates. The posterior means and the corre- 
sponding 90% credible intervals of the fixed effect para- 
meters are shown in Table 3. The risk of cholera 
infection is observed to be associated with high popula- 
tion density and livelihood at slummy environments. 
Moderate difference occurs between the risk of infection 
in populous communities and the risk of infection in 
slummy. Thus the effect of p pop on Cho\ R ) is 0.32 (0.20 - 
0.44) and the effect of c 5 / Mm on Cholt R ) is 0.28 (0.16 - 
0.40). The nonlinear effects of p dump , d dumpi and d reser 



Table 2 Estimates of fixed effect parameters based on the 
linear Model 1 



Variable 


Mean 


Std. error 


10% 


90% 


constant 


0.444* 


0.213 


0.171 


0.718 


Qlum , Yl 


0.267* 


0.098 


0.141 


0.393 


Ppop^Y^ 


0.344* 


0.089 


0.230 


0.457 


Pdump j ft 


0.156* 


0.039 


0.107 


0.206 


ddump i ft 


4.99E-05 


7.19E-05 


-4.40E-05 


0.00014 


dreser, ft 


-6.54E-05 


6.42E-05 


-1.44 E-04 


1 .63E-05 



* Significance at p <0 .01. 



Table 3 Estimates of posterior mean and 90% credible 
intervals for the fixed effects for Model 3 



Variable 


Mean 


Std. error 


10% 


90% 


Constant 


0.73* 


0.081 


0.63 


0.83 


Cs/um , Yl 


0.28* 


0.095 


0.16 


0.40 


PpopiY^ 


0.32* 


0.092 


0.20 


0.44 



^Significance at p <0 .01. 



are shown in Figures 2, 3, and 4, respectively. The rela- 
tionship between Chol^ and Pdump is nonlinear, with an 
expected increasing risk (Figure 2), preceded by approxi- 
mate equal risk up top dump = 1.8. In other words, the 
risk of cholera infection is equal and lower for commu- 
nities with fewer refuse dumps, but increases with in- 
creasing refuse dumps ivomp dump = 1.8. For d dump , the 
risk of infection remains constant up to approximately 
500 m, and then deviates from linearity with a general 
decreasing trend (Figure 3). The effect of d reser is almost 
linear, with the posterior mean decreasing with increas- 
ing distance (Figure 4). 

Spatial effects 

Figure 5 shows the estimated total spatial effects (left) 
and the corresponding 80% (credible interval) posterior 
probability map (right) of cholera risk. Areas shaded 
black show strictly negative credible intervals, while 
white areas depict strictly positive credible intervals, and 
grey indicate areas of non-significant spatial effects. 
There is evidence of significant clustering of cholera, 
with higher cholera risk occurring at the central part, 
and a lower risk occurring at the south-eastern part (the 
periphery) of Kumasi (Figure 5). The unstructured 
spatial effects are dominant over the structured spatial 
effects. This is shown by the higher ratio of variance 
components (p umtr = T 2 um J \r 2 str + r 2 unstr ) = 0.64(Table4). 
The lesser variations in the caterpillar plots of Figure 6a 
compared with Figure 6b also confirms that the unstruc- 
tured spatial effects are dominant over the structured 
spatial effects. 

Sensitivity analyses 

Since the regression parameters depend on the choice of 
hyper-parameters, we rerun the MCMC simulations, 
using Model 3 for simplicity, to investigate the sensitivity 
of our results to different choices of hyper-parameters. 
In particular, the following alternatives of priors have 
been investigated: IG (* = 0.01, 6 = 0.01), IG (a = 0.5, 
b = 0.0005) and IG (a = 1, b = 0.005). The first alternative 
and the standard option IG (a = 0.001, b = 0.001) are 
commonly used choices for the variances of random 
effects. The second and third alternatives are suggested 
by Kelsall and Wakefield [28] and Besag and Kooperberg 
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[27], respectively. Results of the sensitivity analysis on 
the choice of hyper-parameters a and b are shown in 
Table 4. It is noticed that the four choices of hyper- 
parameters yielded similar inferences for the posterior 
means of the fixed parameters. Minor differences, how- 
ever, occur between the variance parameters for the 
nonlinear functions and the spatial effects suggesting the 
robustness of our choices. Thus, indicating that our 
model is less sensitive to the choice of hyper-parameters. 

Discussion 

This study utilizes geoadditive modeling approach to de- 
velop a multivariate explanatory model for the risk of 
cholera. We utilize a Bayesian semi-parametric regres- 
sion model to elucidate the probability of cholera infec- 
tion in relation to associated risk factors, some identified 



from previous studies [11,12]. The geoadditive modeling 
approach is an extension of the GAM which allows the 
inclusion of both structured and unstructured spatial 
effects to account for possible unobserved factors and 
heterogeneity terms. To allow flexibility, the continuous 
covariates are modeled non-parametrically as nonlinear 
functions using P-splines with second-order random 
walk priors based, this based on contributions by 
Farhmeir and Lang [29,30] and Fahrmeir et al [18]; 
while the categorical covariates are modeled as fixed 
effects. The spatially structured and unstructured effects 
are modeled using Markov random filed priors and zero 
mean Gaussian heterogeneity priors, respectively [31]. In 
this modeling approach, fully Bayesian inferences based 
on MCMC simulations are preferred because the func- 
tionals of the posterior can be easily computed, thereby 
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easily quantifying the uncertainty in the estimated para- 
meters [18]. 

The findings of the study show that the risk of cholera 
infection is high amongst inhabitants dwelling in slums. 
The risk of infection is also relatively high in densely 
populated communities. These relationships may exist 
because most communities with slummy settlers are 
densely populated. Although cholera is transmitted main- 
ly through contaminated water or food, poor sanitary 
conditions in the environment enhance its transmission. 
The cholera vibrios can survive and multiply outside the 
human body and can spread rapidly where living condi- 
tions are overcrowded and where there is no safe disposal 



of solid waste, liquid waste, and human feces [3,4] . These 
conditions are mostly met in slummy and densely popu- 
lated communities in Kumasi. Such high population 
density may necessarily result in shorter disease trans- 
mission paths, thus increasing the risk of cholera in- 
fection. Also, inhabitants living at slummy areas are 
generally poor, and face problems including access to 
potable water and sanitation. In many cases public util- 
ities providers (e.g. water distribution) legally fail to serve 
these urban poor due to factors regarding land tenure 
system, technical and service regulations, and city devel- 
opment plans. Most slum settlements are also located at 
low lying areas susceptible to flooding. Unfavorable 
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Table 4 Summary of the sensitivity analysis of the choice of hyper-parameters for Model 3 





a = 0.001 


a = 0.01 


a = 0.5 


a = 1 




6 = 0.001 


6 = 0.01 


6 = 0.0005 


6 = 0.005 


Spatial effects* 


f str(s), T 2 str 


0.02 


0.028 


0.004 


0.004 




(0.0005 - 0.0.06) 


(0.003 - 0.07) 


(0.00009 - 0.01) 


(0.0006 - 0.0009) 


funstr(s), T lnstr 


0.02 


0.031 


0.007 


0.0071 




(0.0009 - 0.0.057) 


(0.005 - 0.056) 


(0.0001 - 0.028) 


(0.0006 - 0.019) 


Smooth functions^ 


{P dump) A 


0.003 


0.006 


0.0014 


0.002 




(0.0005 - 0.006) 


(0.002 - 0.013) 


(0.0002 - 0.003) 


(0.0006 - 0.004) 


f2(ddump)jl 


0.003 


0.0078 


0.0007 


0.002 




(0.0002 - 0.0058) 


(0.002 - 0.017) 


(0.00008 - 0.0015) 


(0.0004 - 0.004) 




0.001 


0.004 


0.0004 


0.001 


h{dreser)j\ 


(0.0002 - 0.0024) 


(0.001 - 0.009) 


(0.00006 - 0.0007) 


(0.0004 - 0.003) 



| Variance components and 90% credible intervals for the spatially structured and unstructured effects; §variance components and 90% credible intervals for the 
nonlinear smooth functions. 



topography, soil, and hydro-geological conditions make it 
difficult to achieve and maintain high sanitation stan- 
dards among such inhabitants [10]. 

The risk of cholera infection is observed to decrease 
with increasing distance from refuse dumps, inhabitants 
within 500 m away from the refuse dumps being the 



most vulnerable. This is consistent with the finding from 
previous studies when a quantitative assessment of crit- 
ical distance discrimination on experimental buffer zones 
around refuse dumps showed that the optimum spatial 
discrimination of cholera occurs at 500 m way from 
refuse dumps [11]. Therefore, we hypothesize that 
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refuse dumps located within 500 m away from inhabi- 
tants enhance the risk of cholera infection compared 
with those farther. The expected decreasing trend of 
Chol( R ) from ddump >500m, however, is apparently grounds 
for strengthening the acceptance of this hypothesis. 
Collectively, the nonlinear effects of d dump and pdump 
on Chol( R ) suggest that cholera risk is relatively high 
amongst inhabitants who live in close proximity to re- 
fuse dumps, and where there are numerous refuse 
dumps. Due to the bad defecation practices of most 
inhabitants, the refuse dumps may contain high fecal 
matter. Surface drainage from such refuse dumps pol- 
lutes water sources with feces which when used per- 
petuates the transmission of cholera vibrios. If the 
runoff from waste dumps during heavy rains serve as 
the major pathway for fecal and bacterial contamin- 
ation of rivers and streams, then it is likely that inha- 
bitants living closer to water bodies where these runoffs 
flow into will have higher cholera prevalence than those 
who live farther. The observed decreasing cholera prev- 
alence with increasing distance from potentially polluted 
surface water bodies (Figure 4), and the significant 
linear relationship between d dump and d reser (results 
from preliminary regression analysis: /? = 0.67, i? 2 = 0.34, 
p < 0.001) support this hypothesis. 

Cholera is primarily driven by environmental and socio- 
economic factors [3,4]; prior knowledge indicates that geo- 
graphically close communities will tend to have similar 
relative risks. Thus, indicating the existence of structured 
spatial variation in the relative risk. The structured spatial 
effects included in the model are surrogate measures of 
unobserved spatially correlated risk factors of cholera. The 
results show clear evidence of significant clustering of 
cholera, with higher cholera risk occurring at the central 
part (the Central Business District), and a lower risk oc- 
curring at the south-eastern part (the periphery) of 
Kumasi (Figure 5). These patterns clearly indicate possible 
unobserved risk factors of cholera, which may be global or 
local. For example, the increased risk at the central part of 
Kumasi may be an influence of high daily influx of traders 
and civil workers from other communities to the Central 
Business District. Such a high daily influx strain existing 
sanitation systems which consequently put people at 
increased risk of cholera. The dominancy of the unstruc- 
tured spatial effects over the structured spatial effects indi- 
cates that the unobserved risk factors are more local than 
global. For instance, household socioeconomic character- 
istics may cause such local spatial variation. Therefore, this 
gives leads for further epidemiological research using add- 
itional information at household spatial scale within the 
study area. 

Unlike classical modeling approaches, our methodo- 
logical concept allows modeling flexibility which can re- 
veal salient features of the continuous covariates. For 



instance, the utilization of only the linear model, Model 
1, would have led to an invalid rejection of the signifi- 
cance of some important risk factors: density of refuse 
dumps, and proximity to potential cholera reservoirs. 
Such modeling approach is useful to establish a better 
epidemiological relationship that exists between the dis- 
ease and the risk factors. Although the methodological 
concept is somewhat mathematically intensive, the avail- 
ability of the public domain software, BayesX, provides 
opportunities for nonprogrammers to utilize these 
methods. 

Limitations of study 

Data limitations have enforced this study to be under- 
taken within a single-scale framework; therefore, signifi- 
cance of scale effects has not been accounted for in this 
study. Consequently, possible biases induced by modifi- 
able areal unit problem (MAUP) have been ignored. If 
data at different levels of spatial scales were available, 
possible bias of MAUP would be evaluated within a 
multi-scale analysis framework as exemplified in Odoi 
et al [32]. Moreover, re- aggregating the data to another 
set of areal units could assess the possible bias of MAUP 
[33]. However, this is impossible due to the limited avail- 
ability of higher resolution data and difficulties in asses- 
sing the ecological fallacy associated. In accordance with 
the general rule of practice, the study analyzed aggre- 
gated data using the smallest areal units for which data 
were available to ameliorate the effects of aggregation. 
Accordingly, statistical inferences in this study are em- 
phasized on the group-level rather than the individual- 
level. 

Also, our choice of neighborhood structure induces an 
assumption that all the inhabitants reside at the centroid 
of the communities. In reality, the communities have 
boundaries whereby their adjacency reflects the true na- 
ture of the spatial structure. Also, the maps of the spatial 
effects should be interpreted with caution as the spatial 
boundaries used are artificial (Thiessen polygons). Per- 
haps different spatial patterns may be visually observed 
if the true boundaries of the spatial units existed. 

Conclusion 

This study applies a Bayesian semi-parametric modeling 
approach to develop an explanatory model of cholera. 
Such flexible modeling approaches allow joint analysis 
of nonlinear effects of continuous covariates, spatially 
structured variation, unstructured heterogeneity, and 
fixed effect covariates. Our model reveals that the risk 
of cholera infection is associated with slum settlements, 
high population density, proximity to and density of 
waste dumps, proximity to potentially polluted rivers 
and streams, as well as possible unobserved risk factors. 
The possible unobserved risk factors are shown by the 
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distinct spatial patterns exhibited by the spatial covari- 
ates; suggesting the need for further epidemiological re- 
search. These findings should serve as novel information 
to help health planners and policy makers in making ef- 
fective decisions about cholera control measures. 
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